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Gene regulatory networks arise in all living cells, allowing the control of gene expression patterns. 
The study of their topology has revealed that certain subgraphs of interactions or "motifs" appear 
at anomalously high frequencies. We ask here whether this phenomenon may emerge because of the 
functions carried out by these networks. Given a framework for describing regulatory interactions 
and dynamics, we consider in the space of all regulatory networks those that have a prescribed 
function. Monte Carlo sampling is then used to determine how these functional networks lead to 
specific motif statistics in the interactions. In the case where the regulatory networks are constrained 
to exhibit multi-stability, we find a high frequency of gene pairs that are mutually inhibitory and self- 
activating. In contrast, networks constrained to have periodic gene expression patterns (mimicking 
for instance the cell cycle) have a high frequency of bifan-like motifs involving four genes with at 
least one activating and one inhibitory interaction. 

PACS numbers: 87.16.Yc, 87.18. Cf, 87.17.Aa 



I. INTRODUCTION 

Both natural and artificial networks have unexpected 
properties that may find their origin in the way they were 
constructed. However, another possibility, in particular 
in the context of biological networks, is that constraints 
associated with network functionality are the main deter- 
minants of these unexpected properties. We focus here 
on gene regulatory networks (GRN), the set of interac- 
tions between genes as well as the rules for expression 
dynamics that allow all living cells to control gene their 
expression patterns. In the last decade, gene interactions 
have been measured, modified, engineered, etc., and so 
quite a lot is known about how any given gene can af- 
fect another's expression. Furthermore, small gene net- 
works have been designed to implement simple functions 
in vivo [l], [H , and much larger sets of interactions have 
been reconstructed in a number of organisms [3|-|5[ . From 
these large networks it has been possible to show that 
several "motifs" - subgraphs with given interactions - 
arise far more often than might be expected [6-9]. One 
of the most studied motif is the so called Feed Forward 
Loop or FFL, a graph based on three genes where the first 
regulates the second, and both the first and the second 
regulate the third. Another example is the bifan motif 
in which 2 genes control two others. Biological functions 
have been proposed for these motifs [l(J hl| which give 
them some meaning, but one may ask whether other mo- 
tifs could perform the same functions and what level of 
enrichment might be expected if function were the sole 
cause of motif over-representation. Unfortunately, the 
functions of GRN and the constraints they must sat- 



isfy (e.g., kinetic response characteristics or robustness 
to noise) are still poorly understood, so such questions 
cannot be addressed in a truly realistic framework. In- 
stead, we will (i) work within a plausible model of tran- 
scriptional regulation, (ii) impose functional constraints 
on the patterns of gene expression, (iii) determine which 
motifs emerge when considering the space of all possi- 
ble functional GRN. This particular task is related to 
previous work that used genetic algorithms or simulated 
annealing to desi gn gene tic networks having given func- 
tional properties 12l4l4j|. 



Those studies found that the 
optimization procedures indeed led to particular archi- 
tectures. Our approach differs by not relying on a design 
procedure: we want to get away from any dependence 
on the optimization algorithm and see how functionality 
on its own constrains the possible architectures. In this 
framework, two types of constraints will be applied: we 
will impose either a set of steady-state expression pat- 
terns, or a time periodic pattern of expression motivated 
by previous studies of cell cycling. Interestingly, we find 



very different motifs for these two cases; in Alon's [15 1 
terminology, bifan, diamond and four point cycle motifs 
appear only in the second case. 

Our model of transcriptional regulation is simple 
enough to be used for illustration and, hopefully, for iden- 
tification of the generic features of genetic networks, but 
at the same time is rooted in bio-physical reality to avoid 
ad-hoc assumptions. It significantly extends the frame- 
work of ref. [16j . in particular by allowing for inhibitory 
interactions. We begin by describing our model and then 
show its properties, in particular the kinds of motifs that 
emerge from the functional constraints imposed on the 
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FIG. 1. Each gene's regulatory region contains N binding 
sites, one for each of the N transcription factors produced 
by the N genes. The probability of occupation (POCC [23f| ) 
of the regulatory region determines the average transcription 
rate of the gene i under consideration. 



networks. 



possible competition effects between all molecules nj of 
type TFj. Using the fact that Z in Eq. (JTJ is close to 1, 
it is possible to approximate the occupation probability 
of the binding site by [l9| : 
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As emphasized in [iff], depends strongly on tiy and 
is appreciable only when the mismatch is small, which 
is an a priori unprobable event imposed in functional 
genotypes by the selection pressure. 



II. THE MODEL 
A. Transcription factor binding 

We start with N genes coding for transcription factors 
that may influence each other's expression. To keep the 
model as realistic as possible, we include the known bio- 
physical determinants of transcriptional control. In par- 
ticular, the binding of a transcription factor (TF) to a 
site is described thermodynamically [l7l - tl9l | and depends 
on the mismatch of two character strings of length L, one 
for the TF and one for the binding site. Up to an addi- 
tive constant, the associated free energy in units of /csT is 
taken to be edkj where dij is the number of mismatches, T 
is the temperature and ks is Boltzmann's constant. The 
parameter e is a penalty per mismatch which has been 
measured experimentally to be between one and three if 
each base pair of the DNA is represented by one charac- 
ter [2014221 ] . Also, by comparing to the typical number of 
base pairs found for experimentally studied binding sites, 
one has 10 < L < 15. For all the work presented here, 
we use e = 2 and L = 12, but we have checked that our 
conclusions are not specific to these values. 

We shall define the "interaction strength" Wij from 
gene j to gene i via the Boltzmann factor 

Wij=e- Bd »lZ (1) 

where Z is normalization (a partition function). If there 
were just one transcription factor molecule of type j, Wij 
would be the probability to find that molecule bound 
in gene i's regulatory region. Gerland et al. [l|| have 
shown that in practice Z in Eq. ([T]) is close to 1 and that 
the probability of finding any given TF molecule bound 
rather than unbound is quite low. 

For simplicity, and to prevent different TFs from ac- 
cessing a same site, we use a standardized form of regula- 
tory region for each gene. This situation is illustrated in 
Fig. [2 for gene i which produces the transcription factor 
TFi. The regulatory region of each gene has N bind- 
ing sites, one dedicated to each of the N different TF 
types. Suppose, that there are rij TF molecules of type 
j that can bind to the site j in gene i's regulatory re- 
gion; given that this site can be occupied by only one TF 
molecule at a time, it is necessary to take into account 



B. Transcriptional control 

Again for pedagogical reasons, we shall consider that 
all genes have the same maximal transcription rate; de- 
noting by n the associated maximum number of TF 
molecules in the system of a given type, we shall set 
n j = Sjn where Sj is then the current (normalized) level 
of transcription for gene j, ranging between and 1. Ex- 
perimentally, n is known to range from of order unity to 
many thousands [24H26I ]. Here we shall use n = 1000, 
but again we have checked that using values ten times 
smaller or larger does not change our conclusions. 

The expression Si of gene i will vary with the pres- 
ence of transcription factors bound in its regulatory re- 
gion, but present knowledge does not provide us with 
quantitative information on this dependence. Much past 
modeling work [27l - l3lj | has dealt with this obstacle by 
considering that each occupied binding site provides an 
activating or inhibitory signal and that all signals are 
then added and compared to a threshold: below (respec- 
tively above) this threshold, transcription is off (respec- 
tively on). However, more recent experimental work and 
associated modeling [32], [33[ suggests that transcription 
rates in vivo can exhibit graded responses. This result is 
not surprising given that transcription factors are some- 
times bound and sometimes not, so any average tran- 
scription rate has no reason to be binary. Our work thus 
follows [H, HH, [33[ by considering continuous transcrip- 
tion rates determined by the probabilities that binding 
sites are occupied. 

Consider first the case where all N transcription factors 
affect gene i as activators. If at least one of the binding- 
sites is occupied by its TF, we consider that the gene 
will be transcribed; this choice corresponds to having the 
transcription rate be proportional to the Probability of 
OCCupation or "POCC" [23| of the regulatory region. 
Calling Pi this probability, we have 

Pi = J2 E p^ N - k) (h,-,jk,j k+ i,...,jN) (3) 

fe=i bu-Jk] 

where ji (ji) is the label of an occupied (unoccupied) 
binding site and jk] stands for a combination of k 

out of N gene labels. Assume now that the bindings arise 
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independently (no cooperativity) , i.e. that the probabili- 
ties in the sum factorize into a product of terms P^ 1 ' ' (j) 

(or P/ 0,1) 0')). Replacing Pj h °\j) by P id defined in Eq.H 
we write 

i./i >•./;•• i js) H''^ IT' l'r> (4) 

i j' 

where j runs over indices for which there is binding and 
j' runs over the other indices. Now, the sum over k in 
Eq. [3] can be explicitly performed. In addition, we iden- 
tify up to an overall scale the transcription rate with the 
probability of occupancy Pi. Considering that protein 
content is proportional to transcription rate (at least in 
the steady state), we set Si - the mean normalized ex- 
pression level of gene i - equal to Pi. One finally gets 

^ = 1-IJ(1-^) (5) 

j 

which is the basic equation of the "mean field" model 
of ref. [16]. The neglect of fluctuations and the corre- 
sponding limitations were already explained there. Note 
that if the Pij are small, transcription is additive in these 
variables, while in the binary limit where Pij is or 1, 
Si corresponds to the logic of transcription being "on" if 
and only if at least one of the binding sites is occupied, 
as expected from the use of the POCC. 

Our treatment of inhibitory interactions (due to repres- 
sors) is new and is motivated by a number of known cases 
where the binding of a TF acts as a veto. One way this 
can happen is if the presence of the TF makes the DNA 
form a loop that conceals the other binding sites. An- 
other mechanism for vetoing transcription is simply for 
the bound TF to block the advance of the polymerase. 
Within our framework, transcription proceeds as in Eq.jSJ 
in the absence of such repressors bound to their sites, but 
as soon as any of the inhibitory sites are bound, tran- 
scription is turned off. Assuming cooperative effects are 
absent as before, and repeating for repressors the argu- 
ment just used for activators, we are led to modify Eq. [5] 
to 

S i =[l-J[(l-P ij )}Y[(l-Pif) (6) 

j j' 

where j runs over activating interactions and j' over in- 
hibitory interactions. 

The transcriptional dynamics is then defined as fol- 
lows. Just like in many other modeling frameworks, we 
take time to be discrete (27l43ll |: at each time step we 
first update the in Eq. [5] (using rij — nSj) and then 
update the Si in Eq. [5] These updates are determinis- 
tic, and in general the system goes towards a fixed point 
(corresponding to steady-state expression levels) or to- 
wards a cycle (corresponding to periodic behavior of the 
expressions in time). 

By neglecting cooperative effects, we obtain a toy 
model where the only parameters are those determin- 
ing the binding probabilities implicit in Eq. [2] and these 



are subject to experimental constraints. Incorporating 
cooperative effects could lead to a more realistic model 
but at the cost of more parameters. For instance, one 
could replace in Eq. [4] the equality by a proportionality. 
Such an assumption often appears in the literature: us- 
ing the stationary limit of appropriate kinetic equations, 
one argues that the concentration of a molecular com- 
plex is proportional to the product of concentrations of 
the constituents. Here, because of the reparametrization 
symmetry of the dynamics, the proportionality constant 
can only depend on k. One could then truncate the sum 
over k, say at k = 3, to avoid too many free parameters, 
a situation that arises in a number of genetic network re- 
verse engineering attempts. Such a model deserves study, 
but this is beyond the scope of the present work. 



C. Genotypes and Phenotypes 

As previously mentioned, the TFs and their binding 
sites are associated with character strings. We are in- 
terested in the space of all GRN, which means here all 
possible character strings. However, it is easy to see that 
all choices of TF character strings are equivalent, so we 
can fix them without any loss of generality. (Biologically, 
it is known that TF and most protein coding genes are 
far more conserved than the TF binding sites. See ref. 
jl6j for a discussion of this point.) Any given GRN is 
then completely specified by the N 2 character strings of 
its binding sites and by the specification of the activat- 
ing or inhibiting nature of each interaction. Since DNA 
bases come in four types, A,C,G,T, we use an alphabet 
of four characters for our strings. This set of strings is re- 
ferred to as the "genotype" of the GRN. Clearly the most 
relevant quantities in a genotype are the mismatches dij 
of these TV 2 strings to their TF string. A genotype can 
then usefully be represented by this N by N matrix of 
mismatches or by the corresponding matrix of interaction 
strengths Wij, plus the sign (activating vs. inhibitory) 
associated with each of these interactions. 

At any time step t, the pattern of mean gene expression 
can be represented by the vector S(i) = {Sj (t) }j=i,... jv- 
We shall consider two classes of functions to be imposed 
on our GRN. The first is motivated by cell types in multi- 
cellular organisms: we want the GRN to be able to have 
steady state expression vectors that are very close to 2, 3, 
or more target patterns, each associated with a different 
tissue. Note that some such patterns involving a dozen or 
so genes have been inferred in various organisms |34l [35j j . 
The second kind of function we shall impose is for the 
vector to follow tightly and step by step a sequence of 
patterns that forms a target cycle. Such cases of cycling 
GRN have been studied previously within threshold and 
boolean models [33, Hit . 

For each type of functional constraint imposed, we 
refer to the "phenotype" of the GRN as (i) the differ- 
ent steady-state expression vectors for the first case; (ii) 
the cyclic pattern of expression vectors for the second 
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case. Given a GRN genotype, determining its phenotype 
is straightforward in practice. In the first case where 
we have given target expression patterns, we start in 
these target vectors and we see whether we converge to a 
nearby fixed point under iteration of the transcriptional 
dynamics. (In contrast, in our previous work, we had 
considered initial states that were unrelated to the tar- 
get vector.) In the second case, we start with one of the 
patterns in the target cycle and see whether the trajec- 
tory under iterations stays close to that cycle. For the 
steady state behavior, we shall impose 2, 3, or more vec- 
tors that consist of N/2 levels at and N/2 at 1, and 
furthermore that are taken to be orthogonal (for the 0/1 
coding for Si this means that the scalar product of two 
vectors is N/4). Setting N = 16 (the choice of N is 
not important as long as it has a moderate value, but 
we have not explored what happens at large N), we de- 
fine four mutually orthogonal targets as follows, a direct 
generalization of that of ref. [Ui : 
1111111100000000 
1111000011110000 
1100110011001100 
1010101010101010 

This choice is motivated by the fact that at large N, 
random binary vectors are typically nearly orthogonal. 
The symmetries of the model are relevant for studying 
it, but with any interesting choice of vectors, most of the 
symmetries are broken. Note that since one can trans- 
form these vectors into each other by index permutation, 
the basins of attraction leading to these targets are on 
average of equal size. 

For the case where one enforces a target cycle, we shall 
use the toy sequence where the genes are taken to lie on 
a ring, and the cycle consists in having the "on" genes 
shift to the right at each time step: 
1111111100000000 
0011111111000000 
0000111111110000 
0000001111111100 
0000000011111111 
1100000000111111 
1111000000001111 
1111110000000011 
1111111100000000 

This is reminiscent of the cycle studied by Li et al. [30j ] 
for the yeast cell cycle. We have also considered a cycle 
where the shift is not by two, but by 1 or 3 steps. The 
results are nearly the same in all these cases. 

To quantify the deviations from the ideal target behav- 
ior, we first check whether we have steady-state behavior 
(in the first case) or cyclic behavior (in the second). For 
each target vector s( tarse *), we define its distance to the 
associated GRN specific expression vector S via: 

D(S, S< ta ^ e *)) = I Si - s[ tar9et) | . (7) 

i 

By summing all these distances, one for each target (each 
steady state in the first case, and each expression vector 




FIG. 2. A schematic representation of our MCMC process, 
(a) Steady state behavior and n = 3: crosses (heavy dots) 
stand for the target (fixed point) states, while the line is 
the system's trajectory. The "total" distance entering the 
Metropolis test is Dt = Di + D2 + D3. (b) Similar as before, 
but for a cycle. Grey dots stand for successive states obtained 
by iterating Eq. \S\ Here Dt — D\ + . . . + Dg. 

of the periodic cycle in the second) , we obtain what we 
refer to as the "total" distance Dt for that GRN. The 
resulting measure of "fidelity" to the imposed function 
can be turned into a kind of fitness via 

F(GRN) = exp (-fD T ) (8) 

where / acts as a control parameter allowing one to be 
more or less stringent on the fidelity. We thus consider 
the set of all GRN and apply the relative weight F(GRN) 
to each; this then provides an ensemble for the GRN, 
and by adjusting / we can focus on those GRN that are 
the most functional. For specificity, we shall work with 
/ = 20, but our results depend only very weakly on this 
choice provided / is in the range 10 to 100. 

D. MCMC sampling 

To sample our ensemble of constrained genotypes, we 
apply a Monte Carlo Markov Chain (MCMC) using the 
Metropolis rule. This computer algorithm produces a 
(biased) random walk in the fitness landscape that visits 
at long times the different genotypes according to their 
fitness as given in Eq. [5] the sampling thus focuses on 
genotypes having high fidelity to the imposed functions. 
In detail, we perform random mutations of the binding 
sites. This produces changes to the edge weights and 
thus to the genotypes. (Technically, it would be possible 
to work at the level of edge weights alone, but it would 
make explanations of the MCMC far more delicate.) A 
sweep is defined as LN 2 successively attempted changes 
of the genotype (a random mutation of one coding letter 
and, independently, a random switch of the sign of one of 
the TF-DNA interactions). Each such change is accepted 
or rejected by the Metropolis algorithm. We always make 
a hot start, using as input a completely random GRN. 
After some time, as in ref. [lfi} . we produce a GRN suf- 
ficiently close to the target (see Fig. [5]), and we use this 
to start the production run of the MCMC. Thereafter, 
we iterate sweeps, recording successive GRNs. Unfor- 
tunately, the simulation requires a lot of computation 
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time, especially for L — 12, where small mismatches be- 
come very improbable. Therefore, we resort to the fol- 
lowing modified procedure. We first set L = 8 and gener- 
ate an MCMC sampling, recording genotypes every 100 
sweeps. Since our dynamics depends on mismatches and 
not on the value of L, these recorded GRNs are also fit at 
L = 12, except that the distribution of the magnitudes 
of the mismatches is wrong. Hence we set L = 12 and 
upgrade our GRNs, obtaining in this way a sample of 
fairly independent genotypes in a reasonable time. 

E. Essential interactions and the essential network 

As already mentioned, genotypes can be represented 
by the N by N matrix of entries W^s along with the 
N 2 signs specifying the activating vs. inhibitory types of 
each interaction. These Wij are never zero (cf. Eq. [IJ, 
so we cannot say that an interaction is completely ab- 
sent. Nevertheless, one may expect some interactions to 
be more important than others, for instance when the 
WijS are larger than average. An arbitrary cut-off could 
be introduced for separating small and large values, but 
it is better to base such a classification on functionality. 
We thus consider what happens when an interaction Wij 
is removed by setting it to zero. Starting with one of 
the genotypes generated by our MCMC (and thus typi- 
cally satisfying well the soft functional constraints), we 
determine the change in fitness produced by setting Wij 
to zero: if the change is rejected by Metropolis in five 
successive attempts, we say that this interaction is essen- 
tial, motivated by the corresponding biological definition 
(a very similar result is obtained by defining the essen- 
tiality as the sensitivity to a single deleterious mutation; 
since the definition of essentiality involves the Metropolis 
test, a random event, one sometimes finds false essentials, 
however this is a very weak effect). This definition leads 
to a summary description of a genotype via a list of pairs 
specifying the essential interactions as well as their 
nature (activating or inhibitory). This can then be rep- 
resented by a directed graph, with + signs on the edges 
that are activating and - signs on the edges that are in- 
hibitory. Hereafter we refer to this oriented and signed 
graph as the essential network of the genotype; note that 
no information on the weights of the interactions is at- 
tached to this network representation. 

III. RESULTS 

A. Abundance of functional GRN 

The space of all GRN is finite in our framework since 
each genotype can be specified by N 2 character strings 
of length L and the signs of the associated interactions. 
In this space we impose the soft constraint that a GRN 
implements a function specified by a certain target ex- 
pression behavior. Is such a constraint very stringent? 



To find out, we have generated millions of random geno- 
types and find that none of them have good expression 
behavior: their fitness is orders of magnitude lower than 
what we obtain from our MCMC sampling. Thus, as in 
other gene network models [37j], by focusing on "func- 
tional" GRN, we are considering an extremely small sub- 
set of all GRN; these very rare GRN may thus very well 
be atypical in many of their properties. Nevertheless, as 
long as the number of constraints is not too high (the 
number of steady states or length of the cycle cannot 
grow indefinitely) , the number of high fitness genotypes 
is huge. Indeed, our MCMC is able to produce essen- 
tially as many different GRN as we want even though 
the ensemble of interest is only an infinitesimal part of 
the whole; this feature arises also in other genotype to 
phenotype mapping models such as RNA neutral net- 
works [381 ] . 

We have also checked that the basin of attraction of a 
target phenotype represents a large fraction of all possible 
initial phenotypes, so that the fact of performing the im- 
posed function is not merely a dynamical accident. These 
basins constitute approximately 99.8(9)%, 52.9(9)% and 
49.6(8)% of the whole space for n = 2, 3 and 4 respec- 
tively. As already explained, with our choice of target 
phenotypes, the basins associated with individual targets 
are equal (after averaging over functional GRNs). 



B. Functional GRN have sparse essential networks 

A question that comes to mind is whether essential in- 
teractions are frequent or not. Consider first the case 
of the multi-stability phenotype where we impose 1, 2, 
3, ... steady states. In previous work [l|| on a simpler 
model allowing no inhibitory interactions, we found that 
imposing a single steady state led to sparse essential in- 
teractions, with the great majority of genotypes having 
just one essential interaction (i,j) for each gene i. In 
the present model allowing for inhibitory interactions, 
this property remains true (see Table HI where some of 
our results are summarized). As we impose more steady 
states, the mean number of essential interactions grows; 
again each gene i will typically have just a few essential 
interactions (and almost never none), with a mean of 1.2, 
1.5, 1.9 for 2, 3, and 4 steady states at N — 16. Further- 
more, these means are quite stable as one increases N, so 
the essential networks of our functional genotypes form 
sparse graphs. The mean degree of these networks is in- 
sensitive to N, but grows when the number of constraints 
imposed on the GRN is increased. 

The situation is similar when a cyclic phenotype is im- 
posed: only a small fraction of the interactions turn out 
to be essential. For the toy cases where the genes are on 
a circle and the cycle shifts the "on" genes by steps to 
the right, we find again that the essential networks as- 
sociated with the genotypes of our MCMC ensemble are 
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TABLE I. A sample of results of our simulations. nFP stands 
for phenotypes with n fixed points. The robustness is defined 
as the frequency of genotypes surviving a random mutation 
according to the Metropolis rule. Notice that it is fairly well 
reproduced by 1 — n/2N, a result generalizing an analogous 
result of ref. The distance to the target is the distance 

entering the fitness. It is nearly constant when divided by the 
number of target phenotypes. Further division by the number 
of active genes in a phenotype, i.e. N/2, yields approximately 
6%, which measures the average deviation of their activity 
from the maximum Si = 1. 



observable 


2FP 


3FP 


4FP 


cycle 


(^essentials) 
(^repressors) 
(robustness) 
(dist2target) 


14.70(1) 
2.500(12) 
0.9494(2) 
0.966(3) 


21.08(2) 
6.38(2) 
0.9259(2) 
1.390(4) 


29.39(1) 
10.11(1) 
0.8998(3) 
1.961(4) 


23.42(1) 
7.38(1) 
0.9099(3) 
3.583(4) 



sparse, and that the connectivity hardly changes as one 
increases the number of genes. 

There is a simple explanation for this sparseness: at 
the level of the GRN, introducing an additional essential 
interaction generally means increasing a Wij. That has 
a high entropy cost as can be seen from the mismatches 
(there are few strings that have a low mismatch, and 
many that have a high mismatch). On the contrary, if 
one were to consider a Boolean model at the level of the 
essential network (to go from genotypes to phenotypes) 
and ignoring the molecular basis of the interactions, one 
would inevitably have far more functional graphs with 
dense interactions than with sparse interactions; sparse- 
ness would then have to be enforced in an ad-hoc way 
since biological networks are indeed sparse experimen- 
tally [H, So| . 



C. Functional essential networks have 
parsimonious inhibitory interactions 

Are inhibitory interactions as frequent as activating 
ones? The answer to this question depends on the inter- 
actions considered. Indeed, even though the genotypes 
generated by the MCMC sampling have functional con- 
straints, the many small Wij arising in genotypes have 
hardly any effect on the phenotype; their sign will thus 
be random, and in effect they act like noise. If instead 
we focus on the larger Wij , the functional constraints are 
likely to bias the sign in favor of activating interactions. 
To avoid an arbitrary definition of large weights, we again 
use the notion of essentiality because of its link with phe- 
notypes. For the essential networks produced from the 
GRN of our MCMC with the constraint of 2 to 4 steady 
states, we find that the great majority of the interactions 
are activating, cf. Table U (these numbers are not sensi- 
tive to N). These results are not surprizing: increasing 



the number of constraints forces the connections to be 
more complex and to make greater use of inhibitory in- 
teractions. In the toy cases of genes on a ring, we also 
see this general picture and find that the number of both 
activating and inhibitory essential interactions grows lin- 
early with N. 



D. Abundance of functional essential networks 

Another question of interest concerns the number of 
distinct functional essential networks (the number of dis- 
tinct GRNs is of little interest, being trivially enormous 
since all inessential interactions can be changed at will 
without affecting the phenotype). It is wise to first find 
the essential networks that are in a sense representative 
for a group of GRNs, in other words to perform a clus- 
ter analysis of the sample of essential networks at our 
disposal. Let the numbers of such networks be M and 
define a distance between a pair of them, for example 

Distance^, B) = ^(Ay - B^f (9) 

y 

where Ay (viz. Bij) is ±1 for essential interactions and 
otherwise. Our question can now be reformulated more 
precisely: does the number of clusters, considered a proxy 
for the number of representative essential networks, sat- 
urate at some moderate value as M or not? (It must 
saturate somewhere, of course, but that may be at very 
large values.) To answer this, it is most convenient to use 
the modern affinity propagation algorithm |4lj , where the 
number of clusters is not preassigned but is determined 
by the algorithm; the code can be downloaded from 
www.psi.toronto.edu/index.php?q=affinity propagation. 
As an illustration, for 4 fixed points we find that the 
number of clusters grows at large M roughly like M 2 / 3 
(with a prefactor of the order of 0.3) and shows no sign 
of saturation up to at least M = 4000. Other values of n 
lead to similar results, but some care is necessary in in- 
terpreting these trends at n — 2 and 3. Indeed, it turns 
out that for these values of n many clusters, distinct ac- 
cording to eq. (9), have essentially the same topology 
and differ merely by the labeling of nodes (this reflects 
symmetries in our choice of the target phenotypes). In 
contrast, for n = 4 the clusters are genuinely different. 
To get more insight into this problem, we have carried 
out a complementary investigation, counting the number 
of distinct topologies (instead of using the clustering al- 
gorithm). This is very tedious and our account of the 
network reparametrizations was only partial. With this 
proviso, it appears that the number of distinct topolo- 
gies again increases like a power of M, however now the 
exponent increases with n (approximately from 0.69 for 
for n = 2 to 0.97 for n = 4). 

Beyond clusters, we can also ask which essential net- 
work topologies are the most frequent. In Fig. [3] we dis- 
play the most frequent topology when imposing n = 2 
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FIG. 3. The most common essential network topologies when 
n steady-state expression patterns are imposed. In each case, 
we see the presence of the motif with two mutually inhibiting 
and self-activating genes. Interactions shown are essential, 
and those genes whose target expression is the same in all the 
steady states are omitted since they provide no information. 
(Data for N = 16; sub-figures a and b are for n — 2 and 4. 
These topologies arise for over 20% and less than 1% of the 
networks, respectively). 

and 4 fixed points. As n increases, the connections be- 
comes more complex as expected; in particular at small 
n much of the topology is tree-like; to a large extent, this 
reflects the sparseness and parsimony of these essential 
networks. Note that our GRN are connected by a suc- 
cession of point mutations. This shows that the same 
function can be performed by a very large number of dis- 
tinct networks, a feature also found in other models of 
gene networks [37J , but here we show that many different 
topologies arise too. 

E. Functionality leads to motif selection 

Working with the full description of genotypes is cum- 
bersome and difficult, whereas focusing just on essential 
networks provides a great deal of intuition, in particu- 
lar for what features are relevant for functionality. The 



FIG. 4. The most common essential network topology when 
one imposes a particular cyclic expression pattern. That pat- 
tern corresponds to a group of active genes that shifts clock- 
wise at each step. One sees very clearly the activating in- 
teractions acting downstream and the inhibitory interactions 
acting upstream. (Data for N = 16; this topology arises for 
over 15% of the networks.) 



price to pay for this simplicity is some loss of information; 
for example, two interactions separately may be non- 
essential but nevertheless if one removes both of them 
the network's functionality may be lost. 

To obtain insights into network structure, one can 
search for network motifs; this has become very popu- 
lar in recent years, to a large extent through the effort 
of U. Alon and collaborators [HI, |42[ (see also the web 
page www.weizmann.ac.il/mcb/UriAlon/). The fact that 
a complex network can be constructed from small stan- 
dard sub-elements is by itself not surprising. For ex- 
ample, this property is at the root of electronics and is 
based on the mathematical structure of logical functions. 
However, the fact that nature also uses this strategy is 
not obvious, and that some motifs and not others are 
employed in different network functions is even less obvi- 
ous. This presence of motifs is revealed though from the 
detailed studies of (rather rare, for obvious reasons) bio- 
logical networks reconstructed from data, and it has been 
partly explained by arguments borrowed from communi- 
cation systems techniques. This brings us to inquire what 
happens in a model where the same dynamics is always 
at work, and where thousands of networks can be gener- 
ated for several network functions: will the same motifs 
emerge when the functionality constraints are modified, 
or will the motifs change with the functions implemented 
by the networks. 

To answer the previous question, we determine the 
motifs in our different ensembles. The web page men- 



TABLE II. Most important motifs. nFP means "n fixed 
points phenotype" . Cycle refers to our 8-step cycle. We typ- 
ically used 1000 GRNs in our motif search. 



motif 


2FP 


3FP 


4FP 


cycle 


motif a: model 
randomized 


0.706(16) 

0.002(1) 


2.358(39) 

0.002(1) 


2.984(4) 

0.002(2) 


0.000 
0.000 


motif b: model 
randomized 


0.000 
0.001(1) 


0.000 
0.008(3) 


0.000 
0.071(9) 


5.451(41) 

0.023(5) 


motif c: model 
randomized 


0.000 
0.000 


0.000 
0.008(3) 


0.000 
0.078(9) 


5.170(40) 

0.029(1) 


motif d: model 
randomized 


0.000 
0.000 


0.000 
0.014(4) 


0.000 
0.0180(6) 


4.533(42) 

0.030(7) 


motif e: model 
randomized 


0.000 
0.017(4) 


0.000 
0.102(16) 


0.000 
0.057(8) 


6.676(22) 

0.173(14) 


motif f: 
randomized 


0.000 
0.000 


0.000 
0.001(1) 


0.000 
0.050(6) 


2.296(29) 

0.003(2) 



tioned above offers a software for motif search; it is not 
quite adapted to our needs, since it does not distinguish 
between activators and repressors, and does not accept 
self-interactions. However, it was helpful in this work, 
enabling us to single out the relevant motif topologies 
(when a topology is irrelevant, it is also so when more de- 
tailed distinctions are introduced). Furthermore, we used 
it to test our own codes for motif extraction. The results 
presented here concern the most prominent motifs; oth- 
ers have frequencies that are either very small or at least 
roughly of the order of the expectation for a random- 
ized network. We discard motifs with leaves (degree-one 
nodes), which are somewhat trivial. We keep only mo- 
tifs that are not a subgraph of a larger motif with the 
same number of nodes. However, our motifs can partly 
overlap. The randomization used is that proposed by 
Maslov and Sneppen [43[: the links are interchanged, so 
that both the in- and out-degrees of network nodes re- 
main unchanged. Our results are summarized in Table ITT1 
and the motifs are listed and defined in Fig. [SJ 

We see right away a very strong dichotomy: the motifs 
are very different for our two classes of functions (im- 
posing multi-stability vs. cycling). In the case of multi- 
stability, one single motif stands out as being extremely 
important: two genes that are mutually inhibitory and 
which are self-activating. Clearly such a pair of genes can 
act as a switch that will then influence downstream genes 
according to the expression pattern that is required for 
the considered fixed point. When dealing with more than 
two target fixed points, additional such motifs should be 
necessary. That is indeed what we saw in Fig. [3] which 
displayed the most represented essential network (ignor- 
ing permutations of indices) for 2 and 4 imposed fixed 



8 




FIG. 5. The most prominent motifs found for our two 
classes of functional constraints. Case of multi-stability (more 
than one steady-state): (a) two mutually inhibiting and self- 
activating genes. Case of expression targets that are cyclic in 
time: (b,c) diamond motif, (d,e) four-node loop, (f) bifan. 

points. Interestingly, the same trend also emerges for 
the less frequent essential networks (data not shown). 
Roughly, the networks display a core of central genes 
that belong to a motif of type "a" (using the nomencla- 
ture shown in Fig. [5]) and these genes then influence other 
genes by a simple downstream effect along the associated 
tree-like graph of activating interactions. 

Now when we look at the motifs present when impos- 
ing cyclic expression targets, the previous motif is ab- 
sent and instead we have several four gene motifs that 
are strongly over-represented. Motif "f" is the bifan 
in the nomenclature of Alon; the others generally in- 
volve a regulatory loop, and in particular motifs "d" 
and "e" are "frustrated" in that they have loops with 
an odd number of inhibitory interactions. None of these 
were over-represented in the networks satisfying multi- 
stability. Their presence can be understood by looking at 
the dominant essential networks such as in Fig. 01 Note 
that these motifs are analogous to the ones driving re- 
pressilators [44{ but they involve more genes and do not 
provide oscillatory behavior on their own, their function 
requires the presence of the other genes that are not in- 
cluded in the motif construction. 



IV. DISCUSSION AND CONCLUSIONS 

The central question tackled by the present work is 
whether the emergence of motifs in gene regulatory net- 
works can be due to functional constraints. Given the 
uncertainties in how real genetic networks function, we 
have taken a modeling route and have addressed this 
question in silico. Our model incorporates known molec- 
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ular mechanisms for the description of genetic interac- 
tions, and in fact the only parameters in our model come 
from parametrizing the affinities of transcription factors 
to their binding sites. Furthermore, in contrast to most 
other gene regulatory network modelings, the associated 
interactions are never completely absent; they can be im- 
portant or unimportant for the functionality of the net- 
work, a notion we characterized by the essentiality of 
interactions. Finally, the expression level of each gene 
follows dynamics allowing for continuous values; this ad- 
ditional complexity compared to using digital "on-off" 
expression levels forces one to consider functionality as 
a soft constraint, imposing expression levels to be "suffi- 
ciently" close to target patterns. Network functionality is 
then quantified via a fitness measure. Such a framework 
provides a close parallel with thermodynamic ensembles; 
all questions are then necessarily posed in a probabilistic 
framework where each network arises with a probability 
proportional to its fitness. In practice, we explore the 
corresponding ensemble of genetic networks numerically, 
using Monte Carlo Markov Chain. 

Two types of gene network functionalities have been 
studied. The first is motivated by the different cell types 
in multi-cellular organisms and is implemented by con- 
straining the genes in the networks to have steady-state 
expression levels close to given target levels; in effect, the 
transcriptional dynamics of the networks must allow for 
multi-stability, that is multiple fixed points of the expres- 
sion dynamics. The second type of functional constraint 
considered is motivated by previous work on the cell cy- 
cle; we implemented such a constraint by forcing the net- 
works to have their expression levels follow a given cyclic 
pattern in time. Thus instead of fixed points, in this case 
we ask for a periodic behavior of the dynamics. In both 
cases, we found characteristic features shared with other 
models of living systems [38| as follows. (1) The con- 
straints imposed are extremely stringent as can be seen 
from the fact that in practice they are never satisfied by 
randomly generated networks. (2) Although the fraction 
of networks of interest is tiny, the number of networks 
satisfying the constraints is astronomical as revealed by 
our Monte Carlo Markov Chain sampling. 

Of interest is the structure of these presumably atypi- 
cal networks. Particular architectures are known to arise 
when performing genetic network design via optimiza- 
tion algorithms [12l - ll4l |. Is this property a bias of these 
algorithms or does it reflect an underlying constraint im- 
posed by network function? It is difficult to tackle this 
question head-on except in very small systems; there one 
can explore all possible values for the model's parame- 
ters [451 ] and see the functional consequences. Since mo- 
tifs can involve three or even more genes inside a larger 
network, a different approach is necessary for moderate 
and large networks. The most adapted tool is based on 
Monte Carlo Markov chains and so we have applied this 
approach to our systems with up to 16 genes. MCMC 



then allows us to sample the space of functional networks 
in spite of the fact that it represents only a tiny fraction 
of the space of all networks. 

Given a gene regulatory network produced by the 
Monte Carlo algorithm, we first extracted the essential 
interactions to obtain what we called the essential net- 
works. This representation gets rid of irrelevant interac- 
tions that are too small to influence much the functional- 
ity. Interestingly, these essential networks are sparse and 
make use of inhibitory interactions parsimoniously. We 
then determined the motifs appearing in these essential 
networks, where a motif is an oriented sub-graph that 
is overly frequent when comparing with a randomization 
test preserving each node's degree. In the case of net- 
works satisfying the multi-stability constraints, we found 
one very dominant motif of two genes acting as a switch: 
each gene represses the other while activating itself. Fur- 
thermore, this motif arose once when imposing two fixed 
points to the dynamics, twice when imposing three fixed 
points to the dynamics etc. This pattern makes good 
sense from a "design" perspective: the choice to go to one 
fixed point rather than to another can be implemented 
most simply by using switches that operate in this logi- 
cal fashion. Moving on now to the ensemble of networks 
that implemented expression patterns that were cyclic in 
time, we found here that the dominant motifs involved 
4 genes as shown in Fig. [5] One of these corresponds to 
the bifan in Alon's nomenclature, but four other motifs 
were also found and in fact were even more often present. 
All of these motifs involve at least one inhibitory inter- 
action; this is appropriate for our imposed cycle as the 
newly turned on genes must at some point turn off the 
other genes they are replacing. Interestingly, the motifs 
we find in one ensemble are not present in the other. This 
shows that functionality is a major determinant of the 
content in motifs, at least within our simplified frame- 
work. Some importance of functionality could have been 
expected a priori, but the size of the effect is striking. 
We hope this result will encourage the search for func- 
tional biases between experimental motifs, in particular 
through comparative studies. 
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